7 research outputs found

    Linear-time Online Action Detection From 3D Skeletal Data Using Bags of Gesturelets

    Full text link
    Sliding window is one direct way to extend a successful recognition system to handle the more challenging detection problem. While action recognition decides only whether or not an action is present in a pre-segmented video sequence, action detection identifies the time interval where the action occurred in an unsegmented video stream. Sliding window approaches for action detection can however be slow as they maximize a classifier score over all possible sub-intervals. Even though new schemes utilize dynamic programming to speed up the search for the optimal sub-interval, they require offline processing on the whole video sequence. In this paper, we propose a novel approach for online action detection based on 3D skeleton sequences extracted from depth data. It identifies the sub-interval with the maximum classifier score in linear time. Furthermore, it is invariant to temporal scale variations and is suitable for real-time applications with low latency

    Neural Rendering Techniques for Photo-realistic Image Generation and Novel View Synthesis

    Get PDF
    Recent advances in deep generative models have enabled computers to imagine and generate fictional images from any given distribution of images. Techniques like Generative Adversarial Networks (GANs) and image-to-image (I2I) translation can generate images by mapping a random noise or an input image (e.g., a sketch or a semantic map) to photo-realistic images. However, there are still plenty of challenges regarding training such models and improving their output quality and diversity. Furthermore, to harness this imaginative and generative power for solving real-world applications, we need to be able to control different aspects of the rendering process; for example to specify the content and/or style of generated images, camera pose, lighting, ... etc. One challenge to training image generation models is the multi-modal nature of image synthesis. An image with a specific content, such as a cat or a car, can be generated with countless choices of different styles (e.g., colors, lighting, and local texture details). To enable user control over the generated style, previous works train multi-modal I2I translation networks, but they suffer from a complicated and slow training, and their training is specific to one target image domain. We address this limitation and propose a style pre-training strategy that generalizes across many image domains, improves the training stability and speed, and improves the performance in terms of output quality and diversity. Another challenge to GANs and I2I translation is to provide 3D control over the rendering process. For example, applications such as AR/VR, virtual tours and telepresence require generating consistent images or videos of 3D environments. However, GANs and I2I translation mainly operate in 2D, which limits their use for such applications. To address this limitation, we propose to condition image synthesis on coarse geometric proxies (e.g., a point cloud, a coarse mesh, or a voxel grid), and we augment these rough proxies with machine learned components to fix and compliment their artifacts and render photo-realistic images. We apply our proposal to solve the task of novel view synthesis under different challenging settings, and show photo-realistic novel views of complex scenes with multiple objects, tourist landmarks under different appearances, and human subjects under novel head poses and facial expressions

    RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis

    Full text link
    We present a large-scale synthetic dataset for novel view synthesis consisting of ~300k images rendered from nearly 2000 complex scenes using high-quality ray tracing at high resolution (1600 x 1600 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis, thus providing a large unified benchmark for both training and evaluation. Using 4 distinct sources of high-quality 3D meshes, the scenes of our dataset exhibit challenging variations in camera views, lighting, shape, materials, and textures. Because our dataset is too large for existing methods to process, we propose Sparse Voxel Light Field (SVLF), an efficient voxel-based light field approach for novel view synthesis that achieves comparable performance to NeRF on synthetic data, while being an order of magnitude faster to train and two orders of magnitude faster to render. SVLF achieves this speed by relying on a sparse voxel octree, careful voxel sampling (requiring only a handful of queries per ray), and reduced network structure; as well as ground truth depth maps at training time. Our dataset is generated by NViSII, a Python-based ray tracing renderer, which is designed to be simple for non-experts to use and share, flexible and powerful through its use of scripting, and able to create high-quality and physically-based rendered images. Experiments with a subset of our dataset allow us to compare standard methods like NeRF and mip-NeRF for single-scene modeling, and pixelNeRF for category-level modeling, pointing toward the need for future improvements in this area.Comment: Project page at http://www.cs.umd.edu/~mmeshry/projects/rtm
    corecore